Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

نویسنده

Chao-Huang Chang

چکیده

In this article, we propose a noisy channel/information restoration model for error recovery problems in Chinese natural language processing. A language processing system is considered as an information restoration process executed through a noisy channel. By feeding a large-scale standard corpus C into a simulated noisy channel, we can obtain a noisy version of the corpus N. Using N as the input to the language processing system (i.e., the information restoration process), we can obtain the output results C'. After that, the automatic evaluation module compares the original corpus C and the output results C', and computes the performance index (i.e., accuracy) automatically. The proposed model has been applied to two common and important problems related to Chinese NLP for the Internet: corrupted Chinese text restoration and GB-to-BIG5 conversion. Sinica Corpora version 1.0 and 2.0 are used in the experiment. The results show that the proposed model is useful and practical.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HHMM-based Chinese Lexical Analyzer ICTCLAS

This document presents the results from Inst. of Computing Tech., CAS in the ACLSIGHAN-sponsored First International Chinese Word Segmentation Bakeoff. The authors introduce the unified HHMM-based frame of our Chinese lexical analyzer ICTCLAS and explain the operation of the six tracks. Then provide the evaluation results and give more analysis. Evaluation on ICTCLAS shows that its performance ...

متن کامل

Text Image Restoration using Adaptive Fuzzy Median Based on 3D Tensors and Iterative Voting

This paper addresses the problem of efficient and effective restoration of text images, by formulating the problem as inferring the surface from a sparse and noisy point set in a 3D structure tensor space. Given a set of noisy data correspondence in corrupted images, the proposed method extracts good matches and rejects the noisy elements. The methodology is unconventional, since, unlike most o...

متن کامل

A Unified Approach to Transliteration-based Text Input with Online Spelling Correction

This paper presents an integrated, end-to-end approach to online spelling correction for text input. Online spelling correction refers to the spelling correction as you type, as opposed to post-editing. The online scenario is particularly important for languages that routinely use transliteration-based text input methods, such as Chinese and Japanese, because the desired target characters canno...

متن کامل

The Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model

The techniques of image processing have been used in optical character recognition (OCR) for a long time. The recognition method evolved from early "pattern recognition" to "feature extraction" recently. The recognition rate is raised from 70% to 90%. But the character by character recognition technique has its limitation. Using language models to assist the OCR system in improving recognition ...

متن کامل

ارائه یک روش جدید بازیابی اطلاعات مناسب برای متون حاصل از بازشناسی گفتار

In this article a pre-processing method is introduced which is applicable in speech recognized texts retrieval task. We have a text corpus, t generated from a speech recognition system and a query as inputs, to search queries in these documents and find relevant documents. A basic problem in a typical speech recognized text is some error percentage in recognition. This, results erroneously ass...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

IJCLCLP

دوره 3 شماره

صفحات -

تاریخ انتشار 1998

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

نویسنده

چکیده

منابع مشابه

HHMM-based Chinese Lexical Analyzer ICTCLAS

Text Image Restoration using Adaptive Fuzzy Median Based on 3D Tensors and Iterative Voting

A Unified Approach to Transliteration-based Text Input with Online Spelling Correction

The Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model

ارائه یک روش جدید بازیابی اطلاعات مناسب برای متون حاصل از بازشناسی گفتار

عنوان ژورنال:

اشتراک گذاری